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Abstract 

We propose a clustering algorithm which, for input, takes data 
assumed to be sampled from a uniform distribution supported on a 
metric space X, and outputs a clustering of the data based on a topo¬ 
logical estimate of the connected components of X. The algorithm 
works by choosing a weighted graph on the samples from a natural 
one-parameter family of graphs using an error based on the heat op¬ 
erator on the graphs. The estimated connected components of X are 
identified as the support of the eigenfunctions of the heat operator with 
eigenvalue 1, which allows the algorithm to work without requiring the 
number of expected clusters as input. 


1 Introduction 

The analysis of complex, high-dimensional data is one of the major 
research challenges in contemporary computer science and statistics. 
In recent years, geometric and topological approaches to data analysis 
have begun to yield important insights into the structure of complex 
data sets (see, for instance, for an example of spectral geometry 
applied to dimension reduction, and §> i for surveys on homologi¬ 
cal methods of data analysis and visualization). The common point 
of departure of these methods is the assumption that data in high¬ 
dimensional spaces is often concentrated around a low-dimensional 
manifold or other topological space. In this note, we begin from the as¬ 
sumption that the data comes from a uniform distribution supported 
on a topologically disconnected space, and that clusters in the data 
reflect this lack of topological connectivity. 

Geometric techniques for data analysis have concentrated on ap¬ 
proximating the geometry of the data as a step toward non-linear di¬ 
mension reduction. Once the dimension is reduced, standard statistical 
techniques are then used to analyze the data in the lower-dimensional 
space. Methods in this class include ISOMAP [^, Locally Linear Em¬ 
bedding j^, Hessian Eigenmaps j^, Laplacian Eigenmaps [^, and Dif¬ 
fusion Maps 1^. Most of these techniques build a weighted graph to 
approximate the Laplace-Beltrami operator on a manifold, or else a re¬ 
lated Markov chain on a graph, and then in practice use the eigenvalues 
and eigenvectors of an approximation to the Laplacian to reduce the 
dimension of the data and then use a fc-means clustering algorithm to 
classify the data [^, [^. In j^, we And a more topologically-oriented 
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approach, in which persistent homology is used to help identify high- 
density regions of a distribution function. 

In this article, we give an algorithm that directly uses the topolog¬ 
ical information in the Laplacian and heat operator, and which also 
demonstrates the utility of considering clustering as a problem of esti¬ 
mating the number of connected components of a distribution whose 
support is disconnected. While the topological aspect of the cluster¬ 
ing problem is generally acknowledged in the topological data anal¬ 
ysis community, this is, to the best of our knowledge, the first com¬ 
pletely data-driven clustering algorithm that explicitly exploits this 
point of view. Second, the algorithm produces both the number of 
clusters and the clusters themselves, unlike popular algorithms such as 
fc-means clustering, in which the number of clusters is required as input. 
The algorithm we present also gives, we believe, the first use of cross- 
validation in a non-commutative context. While this appears formally 
like standard cross-validation, we find ourselves outside the standard 
context of commutative probability, and the usual proofs of conver¬ 
gence no longer apply. Rigorous justification of this cross-validation 
technique is an important topic for future study, but we do not ad¬ 
dress this here. Finally, we give the first algorithm for automatically 
choosing the bandwidth of the kernel function used in approximations 
to Laplace-Beltrami operators. 


2 The Heat Operator on a Family of Graphs 


We will see in Section]^ that, for data sampled from a uniform distri¬ 
bution supported on a disconnected manifold, the clustering problem 
reduces to identifying the 0-eigenfunctions of a certain graph Laplacian 
L, or, equivalently, the 1-eigenfunctions of the the heat operator 
at some t > 0. 

We now describe how to construct the family of graph Laplacians 
from which we will choose L. First, let S = C X be the points 

sampled from X. Given a subset S' C S, We define the matrix 
to be 


{Lr,S')(i,j) = 


j, 

Xi G S' 


{Xi , Xj ) 


0 


otherwise 


where K^. : X x X —>■ K is a function which satisfies 

1. Kr{x,y) = fr{d{x,y)) for some non-negative function fr 

2. Kj.{x, •) —>■ (5a;(-) as a distribution as r —0. 

3. Kr{x,y) dy,{y) = l^r £ {0,oo),x £ S. 

In our applications, will also have compact support. We will some¬ 
times use Lr.n to denote Lr,s- 
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We likewise define the corresponding heat operators by 
-tL^ s' ^ 

k\ 

k=l 

3 Topology and Clustering 

We suppose that our data is sampled from a uniform distribution which 
is supported on a disconnected subspace X embedded in a topological 
space Y. Our task is then to estimate the number of connected com¬ 
ponents of X from the samples and to decide which points belong to 
which connected component. Given the sample points S from X, we 
generate a one-parameter family of weighted graphs with vertices at 
S', with Laplace operator on M, and we appeal to several basic facts 
of Hodge theory (see, for instnace, [^, Chapter 3). 

First, we recall that the Laplace-Beltrami operator A : LF‘{M) —>■ 
L'^{M) may be defined by A = —d*d. We first recall that the dimension 
of the kernel of the Laplace-Beltrami operator A : Lp'{M) —>■ Lp'{M) is 
equal to the 0-th real Betti number /3o of M, which gives the number 
of connected components of M. Also, since A = —d*d, the functions 
in the kernel are constant on each connected component (also Lemma 
3.3.5 in [^). The following proposition now follows easily. 

Proposition 3.1. Let (M,g) be a smooth Riemannian manifold, and 
let /3o denote the 0-th Betti number of M. Let ^6 a basis 

of the kernel of the Laplace-Beltrami operator A of M, and define the 
map F : M ^ K/3o(m) 

F{x) ■.= {h{x),...,U„{x))GRf<Y 

Then the image of F consists of exactly Pq(M) points in , and 

the image of each connected component of M is a single point. 

Proof. First, we know from [^, Section 3.3 that each basis function 
is constant on each connected component of M, and it follows that 
each connected component is sent to a single point. It only remains to 
show that no two connected components are sent to the same point. 
Consider the matrix A defined by = fi{Mj). Since the fi are 
linearly independent, this is a fio{M) x fio{M) matrix of full rank. 
Suppose now that there are two connected components. Mi and M 2 , 
whose image under F is the same point x G Then two rows of 

A are the same, and the rank A < /?o(M), a contradiction. Therefore, 
all of the connected components of M are sent to different points in 

R/3o(M)^ □ 

We note, too, that Proposition |3.1| also hold for graphs and the 
graph Laplacian instead of manifolds, with a nearly identical proof. 


3 



4 Cross-Validation and Bandwidth Selec¬ 
tion 


When working with approximations to the Laplace-Beltrami operator, 
the properties of the approximation depend on the particular choice of 
kernel Kr and bandwidth r used. In practice, this parameter is gener¬ 
ally chosen on an ad hoc basis, or else based on an asymptotic formula 
with unknown constants, as, for instance, in |^. In these methods, the 
resulting graphs are connected for any given bandwidth, and the exact 
dependence of the clustering on the bandwidth is unclear. In our case, 
we use a compactly supported kernel function, so the bandwidth di¬ 
rectly dictates the connected components of the graph, and it becomes 
critical for us to have a systematic method for choosing this parameter. 

In this article, we make this choice using a variant of the so-called 
"elbow method", which is often used to estimate the number of clusters 
in k-means clustering or the number of eigenvalues to use in principal 
component analysis. The idea in this method is the following. When X 
is a manifold, would ideally like to choose a bandwidth which minimizes 
an error functional of the form E{r) := — e~^\\HSt where 

||^llffS= {Tr{A* denotes the Hilbert-Schmidt norm. Let the 
operator : L^{X) —)■ L‘^{X) be defined by 



where/r is a uniform measure on the manifold X. LetV^(r) := — 

and let B{r) := —e~^\\HS, and note the B{r) and V{r) 

are the natural bias and variance terms for this problem, respectively. 
Now, E{r) < B{r) -\- V{r) by the triangle inequality, and we estimate 
V (r) using the cross validation sum 



(As mentioned in the introduction, it is actually not obvious that this 
is an appropriate estimate for V{r), given the norm and the non¬ 
commutativity in the problem. We do not discuss this further here, 
however, but we will return to this question in future work.) The bias 
B{r) is difficult to estimate, but we may nonetheless attempt to choose 
the bandwidth tq such that increasing r beyond tq produces diminish¬ 
ing returns with respect to decreasing the variance, i.e. to find tq such 
that 




^ max ^ min 


for r > rg. We further estimate this to be the value fg which is the 
largest value of r such that 


V{r + l)-V{r-l) ^ V{rma.) - V{1) 
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where r^ax is the diameter of the data set. 


5 Compensating for estimation error 

The results in Section tell us that the points in each connected com¬ 
ponent of our graphs should be sent to exactly the same point in 
by the map F. In practice, however, there are small amounts of es¬ 
timation error in the algorithms for computing eigenvalues and eigen¬ 
vectors, and this must be accounted for when constructing the final 
clustering. We do this with a modified version of Gaussian elimination 
on the matrix formed by the eigenvectors, which we now describe. 

First, note that the j-th entry in the eigenvector fi is the value of 
the eigenfunction fi evaluated on the point Xj. Let T be the matrix 
defined by 


(^)(bi) = (/*)l 


We give a modified Gaussian elimination algorithm in Algorithm 
For what follows, let n denote the number of points in our sample. 


Algorithm 1 Modified Gaussian elimination on 'F 
1: for z = 1 to /3o(A) do 

2: Reorder columns i through n of 'F so that |'I'(iy)| is the maximum of 

l^(i,i)l in row i. 

3: Divide row i by '!'(*,*) 

4: Using elementary row operations, make ^(k,i) = 0 for k ^ i. 

5: end for 

6: Let fi := 'Lj, and (abusing notation) define the map <l>(x) := 

(/l---,//3o(X)) 


Note that the algorithm, if there was no estimation error, would 
send each point in the sample to one of the vectors Ci in the standard 
basis of Now, however, even given some numerical error, we 

are able to cluster the sample points according to how close $(a;) are 
to each of the vectors e^. 

6 Algorithm and Experiments 

We now give the complete algorithm and the results of some numerical 
experiements. 

For our experiments, the functions Kr{x,y) used are 



where B{x,r) C X are balls centered at x of radius r, and ^ is the 
standard Lebesgue measure in the ambient Euclidean space. 
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Algorithm 2 Clustering algorithm 
1: Generate N random subsamples Si of S 
2: For each r < Diam{S), compute Lr^n, Lj.^Si and 

3: Find the max f of V{r) 

4: Compute the 1-eigenvectors of (pi, i € 1... k 

5: Using algorithmic create the map <1> : x i—)• <h(x) = {(pi{x),..., (pk{x)) G 

6: Compute the distances di{x) = ||‘h(x)—ej|| for each point x in the sample. 
7: Put the point x in the i-th cluster if di{x) < dj{x) for all j ^ i. 


The following figures summarize the output of this algorithm on a 
data set of 500 points sampled with a small amount of Guassian noise 
from three circles embedded in The horizontal circle has radius 1, 
and the other two have radius 0.5 and are centered on a random point of 
the horizontal circle. The standard deviation of the noise was taken to 
be 0.05. The Figure 6.1 shows the curve giving the variance estimate, 
and figure |6.2| shows the image of the eigenfunctions, after passing 
through the modified Gaussian elimination procedure. Finally, Figure 
|6.3| shows the clustering given by the algorithm, with the different 
clusters colored in red, green, and blue. 



Figure 6.1: The Estimate of the Variance 
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Figure 6.2: The image of centered around the points (1, 0, 0), (0,1, 0), and 
( 0 , 0 , 1 ) 
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Figure 6.3: Clusters 
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